Skip to content

Prefer chat_template.jinja #205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 2, 2025

Conversation

DePasqualeOrg
Copy link
Contributor

This still needs to be tested.

@DePasqualeOrg
Copy link
Contributor Author

I tested this and verified with debug print statements that it works. If someone can provide an example of a repo that has a chat template stored in chat_template.jinja and not in tokenizer_config.json, I can add a test.

While I was testing this in the LLMEval app in mlx-swift-examples, I got some build errors in updateTokenizerConfig. The app builds when I comment out this function and the line on which it is called. I don't understand what the function is doing, why these errors now appear, or how to resolve them (cc @davidkoski).

@DePasqualeOrg DePasqualeOrg marked this pull request as ready for review May 30, 2025 19:47
@davidkoski
Copy link
Contributor

davidkoski commented May 30, 2025

While I was testing this in the LLMEval app in mlx-swift-examples, I got some build errors in updateTokenizerConfig. The app builds when I comment out this function and the line on which it is called. I don't understand what the function is doing, why these errors now appear, or how to resolve them (cc @davidkoski).

What kind of errors did you see? This function provides some overrides for the swift-transformers tokenizer configuration. e.g. for models that are not directly supported in swift-transformers. From swift-transformers:

struct TokenizerModel {
    static let knownTokenizers: [String: PreTrainedTokenizerModel.Type] = [
        "BertTokenizer": BertTokenizer.self,
        "DistilbertTokenizer": BertTokenizer.self,
        "DistilBertTokenizer": BertTokenizer.self,
        "RobertaTokenizer": BPETokenizer.self,
        "CodeGenTokenizer": CodeGenTokenizer.self,
        "CodeLlamaTokenizer": CodeLlamaTokenizer.self,
        "FalconTokenizer": FalconTokenizer.self,
        "GemmaTokenizer": GemmaTokenizer.self,
        "GPT2Tokenizer": GPT2Tokenizer.self,
        "LlamaTokenizer": LlamaTokenizer.self,
        "T5Tokenizer": T5Tokenizer.self,
        "WhisperTokenizer": WhisperTokenizer.self,
        "CohereTokenizer": CohereTokenizer.self,
        "Qwen2Tokenizer": Qwen2Tokenizer.self,
        "PreTrainedTokenizer": BPETokenizer.self,
    ]

Anyway, this method will substitute names:

    private var replacementTokenizers = [
        "`InternLM2Tokenizer`": "PreTrainedTokenizer",
        "Qwen2Tokenizer": "PreTrainedTokenizer",
        "Qwen3Tokenizer": "PreTrainedTokenizer",
        "CohereTokenizer": "PreTrainedTokenizer",
    ]

Qwen2Tokenizer and CohereTokenizer are in the knownTokenizers list so they could be removed here. Note: despite the fact that they show named types for those tokenizers, they are just empty subclasses of BPETokenizer so this currently works.

Anyway, let me know what the error looks like and I can help figure it out!

@DePasqualeOrg
Copy link
Contributor Author

These are the errors that I'm getting:

private func updateTokenizerConfig(_ tokenizerConfig: Config) -> Config {
    // workaround: replacement tokenizers for unhandled values in swift-transform
    if let tokenizerClass = tokenizerConfig.tokenizerClass?.stringValue,
        let replacement = replacementTokenizers[tokenizerClass] // Error: Cannot convert value of type 'Config' to expected argument type 'String'
    {
        var dictionary = tokenizerConfig.dictionary // Error: Ambiguous use of 'dictionary'
        dictionary["tokenizer_class"] = replacement
        return Config(dictionary)
    }

    return tokenizerConfig
}

You can test it with this branch: https://github.com/DePasqualeOrg/mlx-swift-examples/tree/test-chat-template-jinja

@davidkoski
Copy link
Contributor

OK, it looks like the API for Config changed in  02b828d.

I think this is the equivalent -- just need to call the new methods:

private func updateTokenizerConfig(_ tokenizerConfig: Config) -> Config {
    // workaround: replacement tokenizers for unhandled values in swift-transform
    if let tokenizerClass = tokenizerConfig.tokenizerClass?.string(),
        let replacement = replacementTokenizers[tokenizerClass]
    {
        if var dictionary = tokenizerConfig.dictionary() {
            dictionary["tokenizer_class"] = .init(replacement)
            return Config(dictionary)
        }
    }

    return tokenizerConfig
}

@DePasqualeOrg
Copy link
Contributor Author

Great, thank you! I created this PR, which can be merged after swift-transformers creates a new version tag and that is picked up in mlx-swift-examples.

ml-explore/mlx-swift-examples#322

@FL33TW00D
Copy link
Collaborator

@DePasqualeOrg created a repo with only a chat_template.jinja to ensure that it works. Looks good!

@DePasqualeOrg
Copy link
Contributor Author

In the repo that you created, maybe you can use a different chat template in tokenizer_config.json that would produce a different rendered prompt than the one in chat_template.jinja.

@pcuenca
Copy link
Member

pcuenca commented Jun 2, 2025

use a different chat template

That's a good idea, I agree.

@FL33TW00D
Copy link
Collaborator

Good thinking @DePasqualeOrg, switched it out here: https://huggingface.co/FL33TW00D-HF/jinja-test/blob/main/tokenizer_config.json#L232

@FL33TW00D
Copy link
Collaborator

Happy to merge if you guys are @pcuenca @DePasqualeOrg

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks a lot @DePasqualeOrg!

Regarding the breaking change in a previous PR, let me try to suggest some synonyms in a new PR, coming up shortly.

@pcuenca pcuenca merged commit c2f302a into huggingface:main Jun 2, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants